Multi-level system memory with near memory capable of storing compressed cache lines

ABSTRACT

A method is described. The method includes receiving a read or write request for a cache line. The method includes directing the request to a set of logical super lines based on the cache line&#39;s system memory address. The method includes associating the request with a cache line of the set of logical super lines. The method includes, if the request is a write request: compressing the cache line to form a compressed cache line, breaking the cache line down into smaller data units and storing the smaller data units into a memory side cache. The method includes, if the request is a read request: reading smaller data units of the compressed cache line from the memory side cache and decompressing the cache line.

RELATED CASES

This application is a continuation of and claims the benefit of U.S.patent application Ser. No. 15/717,939, entitled, “MULTI-LEVEL SYSTEMMEMORY WITH NEAR MEMORY CAPABLE OF STORING COMPRESSED CACHE LINES”,filed Sep. 28, 2017, which is incorporated by reference in its entirety.

FIELD OF INVENTION

The field of invention pertains generally to the computing sciences,and, more specifically, to a multi-level system memory with near memorycapable of storing compressed cache lines.

BACKGROUND

A pertinent issue in many computer systems is the system memory (alsoreferred to as “main memory”). Here, as is understood in the art, acomputing system operates by executing program code stored in systemmemory and reading/writing data that the program code operates onfrom/to system memory. As such, system memory is heavily utilized withmany program code and data reads as well as many data writes over thecourse of the computing system's operation. Finding ways to improvesystem memory accessing performance is therefore a motivation ofcomputing system engineers.

FIGURES

A better understanding of the present invention can be obtained from thefollowing detailed description in conjunction with the followingdrawings, in which:

FIG. 1 shows a computing system with a multi-level system memory;

FIGS. 2a, 2b, 2c, 2d, 2e and 2f pertain to write behavior of an improvedmulti-level system memory subsystem;

FIG. 3 pertains to write behavior of an improved multi-level systemmemory subsystem;

FIGS. 4a, 4b, 4c and 4d pertain to hardware embodiments for both amemory controller and near memory of an improved multi-level systemmemory subsystem;

FIG. 5 pertains to a methodology of an improved multi-level systemmemory subsystem;

FIG. 6 shows a computing system.

DETAILED DESCRIPTION 1.0 Multi-Level System Memory

Recall from the Background discussion that system designers seek toimprove system memory performance. One of the ways to improve systemmemory performance is to have a multi-level system memory. FIG. 1 showsan embodiment of a computing system 100 having a multi-tiered ormulti-level system memory 112. According to various embodiments, asmaller, faster near memory 113 may be utilized as a cache for a larger,slower far memory 114.

In the case where near memory 113 is used as a cache, near memory 113 isused to store an additional copy of those data items in far memory 114that are expected to be more frequently used by the computing system. Bystoring the more frequently used items in near memory 113, the systemmemory 112 will be observed as faster because the system will oftenread/write from/to items that are being stored in faster near memory113. For an implementation using a write-back technique, the copy ofdata items in near memory 113 may contain data that has been updated bythe CPU, and is thus more up-to-date than the data in far memory 114.The process of writing back ‘dirty’ cache entries to far memory 114ensures that such changes are preserved in non-volatile far memory 114.

According to various embodiments, near memory cache 113 has lower accesstimes and/or higher bandwidth than the lower tiered far memory 114 Forexample, the near memory 113 may exhibit reduced access times by havinga faster clock speed than the far memory 114. Here, the near memory 113may be a faster (e.g., lower access time), volatile system memorytechnology (e.g., high performance dynamic random access memory (DRAM)and/or SRAM memory cells) co-located with the memory controller 116. Bycontrast, far memory 114 may be either a volatile memory technologyimplemented with a slower clock speed (e.g., a DRAM component thatreceives a slower clock) or, e.g., a non-volatile memory technology thatis slower (e.g., longer access time) than volatile/DRAM memory orwhatever technology is used for near memory.

For example, far memory 114 may be comprised of an emerging non-volatilerandom access memory technology such as, to name a few possibilities, aphase change based memory, a three dimensional crosspoint memory,“write-in-place” non-volatile main memory devices, memory devices havingstorage cells composed of chalcogenide, multiple level flash memory,multi-threshold level flash memory, a ferro-electric based memory (e.g.,FRAM), a magnetic based memory (e.g., MRAM), a spin transfer torquebased memory (e.g., STT-RAM), a resistor based memory (e.g., ReRAM), aMemristor based memory, universal memory, Ge2Sb2Te5 memory, programmablemetallization cell memory, amorphous cell memory, Ovshinsky memory, etc.Any of these technologies may be byte addressable so as to beimplemented as a main/system memory in a computing system rather thantraditional block or sector based non-volatile mass storage.

Emerging non-volatile random access memory technologies typically havesome combination of the following: 1) higher storage densities than DRAM(e.g., by being constructed in three-dimensional (3D) circuit structures(e.g., a crosspoint 3D circuit structure)); 2) lower power consumptiondensities than DRAM (e.g., because they do not need refreshing); and/or,3) access latency that is slower than DRAM yet still faster thantraditional non-volatile memory technologies such as FLASH. The lattercharacteristic in particular permits various emerging non-volatilememory technologies to be used in a main system memory role rather thana traditional mass storage role (which is the traditional architecturallocation of non-volatile storage).

Regardless of whether far memory 114 is composed of a volatile ornon-volatile memory technology, in various embodiments far memory 114acts as a true system memory in that it supports finer grained dataaccesses (e.g., cache lines) rather than only larger based “block” or“sector” accesses associated with traditional, non-volatile mass storage(e.g., solid state drive (SSD), hard disk drive (HDD)), and/or,otherwise acts as a byte addressable memory that the program code beingexecuted by processor(s) of the CPU operate out of.

Because near memory 113 acts as a cache, near memory 113 may not haveformal addressing space. Rather, in some cases, far memory 114 definesthe individually addressable memory space of the computing system's mainmemory. In various embodiments near memory 113 acts as a cache for farmemory 114 rather than acting a last level CPU cache. Generally, a CPUcache is optimized for servicing CPU transactions. By contrast, a memoryside cache is designed to handle, e.g., all accesses directed to systemmemory, irrespective of whether they arrive from the CPU, from aPeripheral Control Hub, a display controller or some component otherthan the CPU.

In various embodiments, system memory may be implemented with dualin-line memory module (DIMM) cards where a single DIMM card has bothvolatile (e.g., DRAM) and (e.g., emerging) non-volatile memorysemiconductor chips disposed on it. In an embodiment, the DRAM chipseffectively act as an on board cache for the non-volatile memory chipson the DIMM card. Ideally, the more frequently accessed cache lines ofany particular DIMM card will be accessed from that DIMM card's DRAMchips rather than its non-volatile memory chips. Given that multipleDIMM cards may be plugged into a working computing system and each DIMMcard is only given a section of the system memory addresses madeavailable to the processing cores 117 of the semiconductor chip that theDIMM cards are coupled to, the DRAM chips are acting as a cache for thenon-volatile memory that they share a DIMM card with rather than as alast level CPU cache.

In other configurations DIMM cards having only DRAM chips may be pluggedinto a same system memory channel (e.g., a double data rate (DDR)channel) with DIMM cards having only non-volatile system memory chips.Ideally, the more frequently used cache lines of the channel are in theDRAM DIMM cards rather than the non-volatile memory DIMM cards. Thus,again, because there are typically multiple memory channels coupled to asame semiconductor chip having multiple processing cores, the DRAM chipsare acting as a cache for the non-volatile memory chips that they sharea same channel with rather than as a last level CPU cache.

In yet other possible configurations or implementations, a DRAM deviceon a DIMM card can act as a memory side cache for a non-volatile memorychip that resides on a different DIMM and is plugged into a same ordifferent channel than the DIMM having the DRAM device. Although theDRAM device may potentially service the entire system memory addressspace, entries into the DRAM device are based in part from readsperformed on the non-volatile memory devices and not just evictions fromthe last level CPU cache. As such the DRAM device can still becharacterized as a memory side cache.

In another possible configuration, a memory device such as a DRAM devicefunctioning as near memory 113 may be assembled together with the memorycontroller 116 and processing cores 117 onto a single semiconductordevice (e.g., as embedded DRAM) or within a same semiconductor package(e.g., stacked on a system-on-chip that contains, e.g., the CPU, memorycontroller, peripheral control hub, etc.). Far memory 114 may be formedby other devices, such as slower DRAM or non-volatile memory and may beattached to, or integrated in the same package as well. Alternatively,far memory may be external to a package that contains the CPU cores andnear memory devices. A far memory controller may also exist between themain memory controller and far memory devices. The far memory controllermay be integrated within a same semiconductor chip package as CPU coresand a main memory controller, or, may be located outside such a package(e.g., by being integrated on a DIMM card having far memory devices).

In still other embodiments, at least some portion of near memory 113 hasits own system address space apart from the system addresses that havebeen assigned to far memory 114 locations. In this case, the portion ofnear memory 113 that has been allocated its own system memory addressspace acts, e.g., as a higher priority level of system memory (becauseit is faster than far memory) rather than as a memory side cache. Inother or combined embodiments, some portion of near memory 113 may alsoact as a last level CPU cache.

In various embodiments when at least a portion of near memory 113 actsas a memory side cache for far memory 114, the memory controller 116and/or near memory 113 may include local cache information (hereafterreferred to as “Metadata”) 120 so that the memory controller 116 candetermine whether a cache hit or cache miss has occurred in near memory113 for any incoming memory request.

In the case of an incoming write request, if there is a cache hit, thememory controller 116 writes the data (e.g., a 64-byte CPU cache line orportion thereof) associated with the request directly over the cachedversion in near memory 113. Likewise, in the case of a cache miss, in anembodiment, the memory controller 116 also writes the data associatedwith the request into near memory 113 which may cause the eviction fromnear memory 113 of another cache line that was previously occupying thenear memory 113 location where the new data is written to. However, ifthe evicted cache line is “dirty” (which means it contains the mostrecent or up-to-date data for its corresponding system memory address),the evicted cache line will be written back to far memory 114 topreserve its data content.

In the case of an incoming read request, if there is a cache hit, thememory controller 116 responds to the request by reading the version ofthe cache line from near memory 113 and providing it to the requestor.By contrast, if there is a cache miss, the memory controller 116 readsthe requested cache line from far memory 114 and not only provides thecache line to the requestor (e.g., a CPU) but also writes another copyof the cache line into near memory 113.

In general, cache lines may be written to and/or read from near memoryand/or far memory at different levels of granularity (e.g., writesand/or reads only occur at cache line granularity (and, e.g., byteaddressability for writes/or reads is handled internally within thememory controller), byte granularity (e.g., true byte addressability inwhich the memory controller writes and/or reads only an identified oneor more bytes within a cache line), or granularities in between.)Additionally, note that the size of the cache line maintained withinnear memory and/or far memory may be larger than the cache line sizemaintained by CPU level caches.

Different types of near memory caching implementation possibilitiesexist. Examples include direct mapped, set associative and fullyassociative. Depending on implementation, the ratio of near memory cacheslots to far memory addresses that map to the near memory cache slotsmay be configurable or fixed.

Some systems may use super lines. Super lines are composed of multiplecache lines. For example, in the case of cache lines that are 64 bytes,a super line may be composed of 16, 32 or 64 different, usuallyadjacent, 64 byte cache lines. Super lines help reduce metadataoverhead, since the cache line tag is associated with each super lineinstead of being associated with each smaller (e.g., 64B) cache line.

2.0 Compressed Near Memory Cache with Logical vs. Physical Cache SuperLines

FIGS. 2a through 2f show an efficient near memory caching approach inwhich the concept of a super line is retained logically, but physicalaccesses between the system memory controller 201 and near memory 203remain at the cache line level or even at smaller increments of a cacheline to achieve compression and low power consumption.

FIG. 2a shows a stream of write requests 210 for particular systemmemory cache lines. For ease of discussion, the example of FIG. 2aassumes that near memory 203 is empty so there is room in the nearmemory cache for each of the write requests.

Here, the near memory controller 202 keeps a tag array 205 (or similarinformation) that identifies a particular pair of “logical” super lines206. As will be made clearer in the following discussion, the pair ofsuper lines 206 are associated with a single, physical super line'sworth of memory capacity 204 in near memory cache 203. Ideally, shouldwrite requests be received for every cache line slot of both logicalsuper lines 206, they will be able to be compressed by compression logic208 in the near memory controller 202 and stored in the single superline of cache space 204 in near memory 203.

That is, if on average every memory write request cache line that isassociated with logical super line pair 206 can be compressed to halfits size, two super lines worth of information can be cached into asingle super line's worth of information 204 in near memory cache 203.It is pertinent to point out that the mapping of two logical super lines206 into one super line of physical cache resources 204 is onlyexemplary and that other ratios of number of logical super lines tonumber of physical near memory super line space is possible (e.g., 3:1,4:1, 3:2, 8:1, etc.).

Referring to FIG. 2b , the write request for the first cache line 211 isprocessed. Here, the system memory address of the first cache line ismapped by the tag array 205 to the pair of logical super lines 206.Here, there are many pairs of logical super lines that map to a singlesuper line in near memory cache, where, every pair of logical superlines is unique (map to a unique set of system memory cache lineaddresses) and map to a unique (different) super line in near memorycache. As such, there are many pairs of logical super lines thatrespectively map to many different physical super lines in near memorycache 203. For ease of illustration, FIG. 2b only shows one pair oflogical super lines 206 and the particular physical super line's worthof cache resources 204 that the pair 206 maps to.

As such, continuing with a discussion of the write processing for thefirst cache line's write request 211, upon receipt of the write request211, the tag array 205 performs a look-up on the cache line's systemmemory address to identify which pair of super lines that the cacheline's address maps to. As can be seen in FIG. 2b , the address of thefirst cache line maps to logical super line pair 206.

Here, many different cache line system memory addresses (beyond twosuper lines worth) may map to a particular pair of logical super lines.As such, multiple cache lines within the system may compete for samenear memory cache resources. Here, as just one example, system memoryaddresses of incoming cache line requests may be hashed by hashing logicof the tag array to identify which particular logical super line pairthat the cache line maps to.

With the write request being directed to a particular logical super linepair 206, management logic circuitry associated with the pair (not shownin FIG. 2b for illustrative ease) determines which cache line location(slot) of the pair of logical super lines 206 that the incoming writerequest is to associated with. In various embodiments the managementlogic circuitry keeps a “free list” that identifies which cache lineslots within the pair of super lines 206 are free (are notplaceholding/consumed on behalf of a cache line that is presently storedin the physical super line 204 in near memory 203). In the presentexample, as discussed above, the near memory 203 is assumed to be emptyas an initial condition. As such, the free list maintained by themanagement logic circuitry lists the entire storage capacity of bothlogical super lines 207_1, 207_2 of the pair 206 as being free.

Under more normal conditions, a number (perhaps all) of the cache lineslots of the pair of logical super lines 206 are consumed (i.e., areplaceholding for a cache line that is presently stored in the physicalsuper line 204 in near memory 203). Under these circumstances, cachehit/miss logic circuitry that is a component of the management logiccircuitry first determines if the system memory address of the writerequest is the same as one of the cache lines that is presently cachedin near memory 203. Here, each cache line slot of the logical super linepair that is consumed/occupied includes meta-data that identifies thesystem memory address of the cache line that it is placeholding for. Ifthe meta-data of an occupied slot includes the same system memoryaddress as the incoming write request there is a cache hit (the cacheline being targeted by the write request currently resides in the nearmemory super line 204).

If there is a cache hit, the write into the physical near memory superline 204 can proceed. Here, ideally, the new compressed cache line thatresults from the new write request is identical in size, or is at leastsmaller in size, than the older compressed version of the cache linethat is residing in the physical super line 204 in near memory 203. Ifany of these conditions are true, the new compressed cache line can bewritten directly over the old compressed cache line. If the newcompressed cache line is larger than the old compressed cache line,various approaches may be undertaken to handle the situation.

According to one approach, a portion of the new compressed cache linethat is commensurate in size with the older compressed cache line iswritten directly over the old compressed cache line in near memory 203and the remaining portion is written into free space (identified on theaforementioned free list) within the physical super line 204 resourcesin near memory cache 204.

If such free space does not exist, the compressed cache line will notfit into the physical super line 204 in its current state. As such,another compressed cache line (e.g., a consumed cache line slot withinthe logical super line pair 206 whose corresponding cache line has beenleast recently used (LRU) from the perspective of the pair's managementlogic) can be evicted from the physical super line 204 in near memory204 to make room for the new compressed cache line. If this approach istaken, the evicted cache line is written to far memory 209. Besides LRU,other cache replacement policies to select the evicted line can be used(e.g., least frequently used (LFU), etc.).

Alternatively, because the new compressed cache line does not fit intothe physical cache line 204 on its own, the management logic associatedwith logical super line pair 206 may choose not to cache the cache lineassociated with the write request and instead write the new updatedcache line into far memory 209. In this case, the meta data kept in thecache line slot of the tag array that the cache line has mapped to isupdated to reflect that the older version of the cache line thatpresently sits in near memory cache can be written over (the cache lineis invalidated and free space has been created in the near memory).

In the present example, again, the near memory cache 203 is assumed tobe empty. As such, the free list identifies all cache slots in the pairof logical super lines 206. The management logic picks one of theseslots (as depicted in FIG. 2b , slot 211_1 is chosen). The near memorycontroller 202 then proceeds to compress and write the cache line intothe near memory super line 204 cache resources.

The write process includes adding meta data for the slot 211_1 that thefirst cache line write request has been allocated to that: 1) identifiesthe system memory address of the first write request's cache line (e.g.,its tag); 2) where the compressed cache line can be found in thephysical super line's worth of near memory cache resources 204 and thesize of the compressed cache line. The former is used for determiningsubsequent cache hit/miss results for later cache line requests that mapto logical super line pair 206. The latter is used to fetch thecompressed cache line or determine the current state of the physicalsuper line of caching resources 204 in near memory 203. In variousembodiments, described in more detail below, the information thatidentifies where the compressed cache line is kept in the near memorysuper line 204 identifies a “starting position” of the compressed cacheline in the near memory super line 204 and a “length” or size of thecompressed cache the extends from the starting position.

In the present example of FIG. 2b , again, the near memory super line204 is empty. As such, the cache line 211_2 associated with the firstwrite request is written at the “head” of the super line 204. Here, aswill become more evident in the following discussion, any cache linehaving a placeholder in the pair of logical super lines 206 can bewritten into any physical location space of the near memory super line204. Although cache line slots are demarcated in the near memory superline 204, in various embodiments, they are presented more as referencemarkers than a hard delimiter as to boundaries of stored cache lines(although other embodiments may choose to not cross such boundaries whenwriting a cache line into the physical super line 204).

As will be described in more detail below, in order to physically storecompressed cache lines into the near memory super line 204, thecompressed cache lines and the physical architecture of near memory 203are broken down into smaller chunks or “blocks”. For example, the amountof storage space for a single cache line in the physical near memorysuper line is broken down into six 11 byte blocks.

Likewise, a compressed cache line is composed of how many such 11 byteblocks are needed to keep all of the compressed cache line'sinformation. For example, in the case of a nominal 64 byte cache linethat is compressed into, e.g., 30 bytes of information, the cache lineis broken down into three such 11 byte blocks. The three 11 byte blocksare then written into the near memory physical super line 204. As can beseen in FIG. 2b , the compressed cache line 211_2 associated with thefirst write request consumes approximately three such blocks as itconsumes approximately one half of a cache line in the near memory superline 204.

FIG. 2c shows the write process for the write request of the secondcache line. The write process for the second cache line follows the samewrite process as the first cache line with the exception that, the firstand second cache line's have different system memory addresses. As such,the management logic assigns the cache line associated with the secondwrite request to a different cache line slot 212_1 within the logicalsuper line pair 206. The second cache line is compressed and stored212_2 in the near memory super line 204 at the tail of the firstcompressed cache line 211_2. The meta data of slot 212_1 is updated toreflect the system memory address of the second cache line, and itslocation and size in the near memory super line 204. As can be seen inFIG. 2c , the second compressed cache line 212_2 approximately consumes⅚^(th) of a cache line's worth of resources in the near memory cache(e.g., five 11 byte blocks). As such, the first two compressed cachelines together 211_2, 212_2 only consume approximately 4/3 of a singlecache line's worth of storage resources in the near memory super line204.

FIG. 2d shows the write process for the write request of the third cacheline 213. Like the first two write requests 211, 212, a look-up isperformed on the cache line's system memory address by the tag arraylogic circuitry 205 to see which pair of logical super lines the cacheline's system memory address maps to. As observed in FIG. 2d , the thirdcache line 313 maps to logical super line pair 206. As with the otherrequests, the third cache line is assigned an unused slot 213_1 in thelogical super line pair 206 and an attempt is made to compress the cacheline and write it at the tail of the physical super line 204 near memory203. However, unlike the first two cache lines 2111, 212, the thirdcache line 213 is not compressible. Here, some cache line data is (e.g.,extremely randomized) such that compression algorithms are not able tosignificantly reduce the footprint size of the data.

As such, even though an attempt is made to compress the third cacheline, its total size is not really reduced. In this case, the thirdcache line will split into six 11 byte blocks 213_2 which are written insuccession at the tail of the second cache line 212_2 in the physicalcache super line 204. As observed in FIG. 2d , the third cache line213_2 approximately consumes an entire cache line's worth of space ofthe super line cache resources 204.

Here, in various embodiments, the near memory controller 202 includeslogic circuitry 208 capable of performing different kinds of datacompression to, e.g., use whichever of these will achieve the mostcompression for the cache line to be written into near memory. Forexample, in one embodiment, three different kinds of compressionexist: 1) all zeros compression (in which no data is consumed in nearmemory at (since the data to be compressed is all zeros, a statement tothat effect is marked in the meta data of the cache line slot that thecache line maps to in the appropriate logical super line); 2) Frequentpattern compression; 3) base-delta compression. For any cache line to becompressed, whichever of these three compression schemes that yields themost compression is the compression scheme that is applied to the cacheline. Other compression algorithms can also be used like Huffmanencoding, Limpel-Ziv (LZ) compression or any variation or combination ofthese algorithms.

Here, the metadata that is kept for a cache line in the logical superline placeholder slots may also include identification of whichcompression technique has been used. In an alternative or combinedapproach, this information is stored within one of the 11 byte blocksthat are physically written into the near memory super line 204 so thatit can be used by decompression logic that will decompress the cacheline if/when it is read from the near memory super line 204.

FIGS. 2e and 2f depict the processing of the write requests for thefourth and fifth cache lines 214, 215 respectively. As observed in FIG.2f , when the fifth compressed cache 215_2 line is finally written intothe near memory super line 204, all five cache lines have collectivelybeen compressed down to an amount of information that only consumesabout 2.5 cache line's worth of near memory super line resources.

FIG. 3 shows an embodiment of standard processing for a read request.Here, a read request is received 1 for the fifth cache line 215_2 (thatwas written during the processing of FIG. 2f ). A look-up is performedon the read request's associated system memory address by the tag arraylogic circuitry 305 which directs the request to logical super line pair206.

The management logic circuitry of the super line pair 206 then proceedsto analyze the meta data of its placeholders and detects a cache linehit with the meta data of slot 215_1. As discussed above, the meta dataalso contains information (e.g., starting position and length) thatidentifies where the physical cache line is kept in the near memorysuper line 204. The cache line 215_2 is then read 3 from the near memorysuper line 204. Here, recognizing that the fifth cache line 215_2 wascompressed to about half a cache line's worth of information, thereading of the cache line 215_2 entails reading the three (or four) 11byte blocks that were consumed in the near memory super line to storethe fifth cache line 215_2. The read cache line is then decompressed 4by decompression logic circuitry and forwarded 5 as the response to theinitial read request.

FIGS. 4a through 4d show various depictions that illustrate anembodiment for the technology associated with breaking of a cache linedown to 11 byte blocks so that storage of compressed cache lines can beeffected.

FIGS. 4a depicts the basic structure of a cache line's worth of datastored in the physical near memory super line 204 of FIGS. 2a through 2f. As can be seen in FIG. 4a , a cache line's worth of data is stored assix 11 byte blocks 401_1 through 401_6. Here, assuming a nominal cacheline is composed of 64 bytes, six 11 byte blocks corresponds to a totalof 66 bytes.

FIG. 4b shows two compressed cache lines being stored in a cache line'sworth of data. As observed in FIG. 4b , in various embodiments, eachcache line that is cached in the near memory super line includes acombination of “header” information (e.g., to indicate what type ofcompression has been applied to the cache line's data) and compresseddata in the first 11 byte block and compressed data in the cache line'sremaining 11 byte blocks. Here, FIG. 4b shows two cache lines 411_1,411_2 that have been compressed by approximately 50% and therefor can bestored in the storage space of a single cache line's worth of data.

In the case of an uncompressed cache line, two bytes of headerinformation plus 64 bytes of uncompressed cache line data corresponds to66 bytes. Hence, an uncompressed cache line (akin to a worst casescenario) can “fit” into the six 11 byte blocks of FIG. 4a . As a cacheline becomes more and more compressed, however, it will consumer fewerand fewer 11 byte blocks in order to be completely store all of itsinformation. For example, a cache line that is approximately reduced by75% (e.g., to 16 bytes of data) will only need two 11 byte blocks tohave both its compressed customer data and its header information fullystored. Thus the breaking down of a cache line (and architecting nearmemory) into multiple 11 byte blocks allows for storage of the minimumnumber of such blocks per cache line in physical near memory. In thismanner, the 11 byte block scheme provides for the physical realizationof the compression (less near memory physical storage space is consumedas more data is able to be compressed).

FIG. 4c shows an embodiment of the structure of an 11 byte block for asystem that supports three types of 11 byte blocks: 1) blocks for anuncompressed cache line 421; 2) blocks for a cache line that iscompressed according to a frequent pattern compression scheme 422; 3)blocks for a cache line that is compressed according to a base-deltacompression scheme 423. Recall that in a same system there may beanother compression option in which the cache line data contains allzeros (or all 1s) and therefore the cache line's data can be stored inits corresponding meta data of the logical super line pair (no data isstored in near memory, just a note in the meta data that the cache linecontains all 1s or all 0s).

As observed in FIG. 4c (and as alluded to above with respect to FIG. 4b), there are two types of blocks: 1) blocks that store headerinformation and cache line data 424; and, 2) blocks that store onlycache line data 425. Header blocks 424 are observed to include headerinformation in bytes 10 and 9 and cache line data in bytes 8 through 0.The header information uses a portion of byte 10 to identify the type ofcompression applied (“uncompressed” for an uncompressed block, “FP” fora block whose cache line data is compressed according to a frequentpattern compression scheme and “BDI” for a block whose cache line datais compressed according to a “Base-Delta” compression scheme). Theheader information also uses the remaining portion of byte 10 and all ofbyte 9 for Error Correction Code (ECC) information. Blocks 425 thatstore only cache line data store cache line data in bytes 10 through 0.

FIG. 4d shows an exemplary near memory controller 402 and near memorydesign 412 for implementing the near memory caching architecturedescribed at length above. Here, in the case of a write operation, awrite request is received into the tag array and logical super line pairlogic circuitry. The tag array circuitry associates the request's systemmemory address with a particular set (pair) of logical super-lines.Management logic circuitry associated with the pointed to set/pair oflogical super-lines performs a cache/hit miss operation and allocates acache line slot location for the request if a cache miss occurs andspace exists in the near memory cache super line.

Additionally, in various embodiments, if no space initially exists forthe cache line, an older one or more cache lines (e.g., least recentlyused cache line(s)) are evicted from the near memory super line to makeroom for the new cache line. The evicted cache line(s) are then writtenback to far memory. If a cache hit exists an attempt is made to writethe cache line over the existing cache line in near memory. Anotheroption is to write the new cache line directly to Far Memory, and keepthe status of the new cache line in the tag array as a “miss”. The sameoption can be used if the new cache line was previously in Near Memory,but the new data to be written to it is uncompressible (or compressibleto a larger size), and there is no place for the new cache line. In thiscase the memory controller may decide to write the cache line to FarMemory, and mark it as invalid in Near Memory.

The write data is received from a data bus and compressed by compressionlogic circuitry 404. Here, the compression logic circuitry 404 maydetermine which of multiple compression options yields the bestcompression. The compression logic circuitry informs the logical superline pair circuitry 405 of the amount of compression so that thecircuitry 405 can determine information that defines where the cacheline will be stored in the near memory cache super line (e.g., thestarting location and length of the cache line within the near memorycache super line).

Additionally, the amount of compression is used by the circuitry 405 todetermine how many older compressed cache lines need to be evicted fromthe near memory cache super line in order to make room for the new cacheline in case there is not enough space available in the physical nearmemory super line in the case of a cache miss, or, in case the newcompressed cache line is larger than the older version of the cache linethat currently resides in near memory in the case of a cache hit.

For example, if the new cache line is compressed by 50% but the leastrecently used cache line (in the case of a cache miss) or the olderversion of the cache line (in the case of a cache hit) was compressed by75% (so that the actual data footprint in near memory is 25% of a fullcache line), at least one more cache line will need to be evicted fromthe near memory cache line in order to make room for the new cache line.In various embodiments, the newly compressed cache line has to befragmented and stored non-contiguously (not as a single cohesive piece)in the near memory super-line. In this case, the meta data for the cacheline is expanded to indicate more than one starting location andcorresponding length that extends from each starting location (so thatthere is one starting location and length for each different piece ofthe cache line that is stored in the near memory super line).

The compression logic circuitry 404 may also inform the logical superline pair management logic circuitry 405 of the type of compressionbeing applied. The starting location, length and type of compressioninformation meta data may be kept with meta data that identifies thecache line's system memory address and whether the cache line is validor not (the cache line will be valid as written but as is understood inthe art subsequent events can cause a cache line to be marked asinvalid).

After compression, the compressed write data is provided to ECCgeneration logic circuitry 406 which generates ECC information forinclusion into the cache line's header information. The headerinformation (composed of information that identifies which type ofcompression has been applied and the ECC information) and the compressedwrite data is presented to the write portion of read/write circuitry 407and parsed into the appropriate number of 11 byte blocks for writinginto memory.

Here, in the basic case where the cache line to be written is notfragmented, the logical super line management circuitry 405 converts thesystem memory address of the write request into lower ordered bits of anear memory address and also converts the meta data information thatdescribes where the cache line is to be stored and the size of the cacheline in the physical near memory super line into higher ordered bits ofthe near memory address. The write portion converts the higher orderedmemory address bits into bank select information for the selected memorybanks 410 in near memory and writes the compressed cache line into asone or more contiguous 11 byte blocks into the appropriate banks 410 ofa particular physical cache super line within near memory.

FIG. 4d also shows an exemplary memory architecture 412 for implementingthe near memory cache. It is important to emphasize that the memoryarchitecture 412 of FIG. 4d is only exemplary and that other memoryarchitectures may also be used. As can be seen in FIG. 4d , the physicalmemory 412 is organized in “words” of six 11B blocks. In the specificmemory “slice” 412 of FIG. 4d , memory resources sufficient to store twosuch words are observable. Specifically, memory banks 410_0 through410_5 (B0-B5) correspond to an upper (“odd”) word of six 11 byte blocksand memory banks 410_6 through 410_11 (B6-B11) correspond to a lower(“even”) word of six 11 byte blocks. In various embodiments, more than 6multiple byte blocks may be used to form a word. Multiple slices likeslice 412 will complete a memory structure that is capable of storing asuperfine.

In an embodiment, the cache line being stored, whether compressed ornot, can start at any 11B block. That is, a cache line need not start ata word boundary. As such there are two possibilities. According to afirst possibility, the written cache line is entirely written into asingle word of the memory resources. In this case, during a readoperation, the memory word that the cache line is contained within isread from the memory. If the cache line was not aligned to the wordboundary, shift circuitry that precedes the ECC and decompressioncircuits along the read path circuitry will additionally shift the readdata to align it to an edge of the read data path. In the case of awrite operation, if the cache line is not aligned to the word boundary,shift circuitry along the write data path will shift the data from theedge of the write data path to align it to the correct block in thememory word prior to writing the line into the correct blocks of theword.

According to a second possibility, the cache line is split over twoadjacent words in the memory. In the case of a read operation, two wordsof data are read from the two memory words that the cache line is storedwithin. The shifting circuitry then shifts the first portion of thecache line to align to the read data path edge and shifts the secondportion of the cache line to the tail of the first portion. In the caseof a write, after compression and ECC, the cache line is split along thewrite data path for writing into two different memory words. The correctblocks of the two memory words are then written to with thecorresponding portions of the cache line. Here, with the secondpossibility (cache split) the memory is designed to permit concurrentreading/writing from/to two memory words.

In various embodiments, the address space of each bank B0-B11 defineshow many physical near memory super lines exist in near memory. Forexample, if each bank is implemented as a memory chip having an 11 bytedata bus and a 10 bit address bus, then, 1024 different super lines canphysically exist in near memory (2¹⁰=1,024). The number of physicalsuper lines that can exist in near memory define the ratio of logicalsuper lines to physical super lines in the system for a particularsystem memory address size (e.g., which may be defined by the size offar memory). The particular memory address that the enabled bank(s)receive during a write process is determined from the conversion of thewrite request's system memory address to the lower ordered bits of thenear memory address.

A read process includes the reception of a read request which is routedto the tag array logic circuitry. The tag array logic circuitryassociates the system memory address of the read request with aparticular set of logical super lines. Management circuitry associatedwith the pointed-to set of logical super lines determines whether acache hit exists (that is, that there exists meta data for a storedcache line in near memory having the same system memory address as theread request).

If a cache hit results, the read request's system memory address and themeta data of the tag array that identifies the starting location andsize of the requested cache line in near memory are converted into anear memory address as described just above for a write request. Therequested cache line is then fetched from near memory and its ECCinformation is checked by ECC circuitry 409. If there is an error in theread data that cannot be corrected with the ECC protection, an errorflag is raised. Otherwise, the read cache line is decompressed bydecompression logic circuitry 408 and provided as the read requestresponse.

If a cache miss results, the read request is re-directed to far memory.The cache line that is read from far memory is provided as the readrequest response. This cache line may also be routed to the near memorycontroller 402 and “filled back” into near memory cache according to thesame process described above for a write request (albeit where a checkis not made for a cache hit and instead the cache line is forced intonear memory).

Although the preceding examples above have emphasized the use of an 11byte block for the deconstruction/reconstruction of a cache line, thoseof ordinary skill will understand that this is a designer's choice andother block lengths are possible (e.g., to achieve larger or smallerheader fields per block). For instance, any byte length between 10 bytesand 30 bytes per block may be sufficient. Here, it is pertinent to pointout that the term “block” as associated with the smaller units that acache line is broken down to is not be confused with the much larger“block” or “sector” of information that has traditionally been used torefer to the unit of data that is stored in mass storage such as a diskdrive.

FIG. 5 shows a methodology described above. The method includesreceiving a read or write request for a cache line 501. The methodincludes directing the request to a set of logical super lines based onthe cache line's system memory address 502. The method includesassociating the request with a cache line of the set of logical superlines 503. The method includes, if the request is a write request:compressing the cache line to form a compressed cache line, breaking thecache line down into smaller data units and storing the smaller dataunits into a memory side cache 504. The method includes, if the requestis a read request: reading smaller data units of the compressed cacheline from the memory side cache and decompressing the cache line 505.

FIG. 6 provides an exemplary depiction of a computing system 600 (e.g.,a smartphone, a tablet computer, a laptop computer, a desktop computer,a server computer, etc.). As observed in FIG. 6, the basic computingsystem 600 may include a central processing unit 601 (which may include,e.g., a plurality of general purpose processing cores 615_1 through615_X) and a main memory controller 617 disposed on a multi-coreprocessor or applications processor, system memory 602, a display 603(e.g., touchscreen, flat-panel), a local wired point-to-point link(e.g., USB) interface 604, various network I/O functions 605 (such as anEthernet interface and/or cellular modem subsystem), a wireless localarea network (e.g., WiFi) interface 606, a wireless point-to-point link(e.g., Bluetooth) interface 607 and a Global Positioning Systeminterface 608, various sensors 609_1 through 609_Y, one or more cameras610, a battery 611, a power management control unit 612, a speaker andmicrophone 613 and an audio coder/decoder 614.

An applications processor or multi-core processor 650 may include one ormore general purpose processing cores 615 within its CPU 601, one ormore graphical processing units 616, a memory management function 617(e.g., a memory controller) and an I/O control function 618. The generalpurpose processing cores 615 typically execute the operating system andapplication software of the computing system. The graphics processingunit 616 typically executes graphics intensive functions to, e.g.,generate graphics information that is presented on the display 603. Thememory control function 617 interfaces with the system memory 602 towrite/read data to/from system memory 602. The power management controlunit 612 generally controls the power consumption of the system 600.

Each of the touchscreen display 603, the communication interfaces604-707, the GPS interface 608, the sensors 609, the camera(s) 610, andthe speaker/microphone codec 613, 614 all can be viewed as various formsof I/O (input and/or output) relative to the overall computing systemincluding, where appropriate, an integrated peripheral device as well(e.g., the one or more cameras 610). Depending on implementation,various ones of these I/O components may be integrated on theapplications processor/multi-core processor 650 or may be located offthe die or outside the package of the applications processor/multi-coreprocessor 650.

The computing system may also include a memory system, such as systemmemory (also referred to as main memory) implemented with a memorycontroller that maintains meta data for logical super lines so thatcache lines of information can be compressed prior to their being keptin a memory side cache for main memory as described at length above.

Application software, operating system software, device driver softwareand/or firmware executing on a general purpose CPU core (or otherfunctional block having an instruction execution pipeline to executeprogram code) of an applications processor or other processor mayperform any of the functions described above.

Embodiments of the invention may include various processes as set forthabove. The processes may be embodied in machine-executable instructions.The instructions can be used to cause a general-purpose orspecial-purpose processor to perform certain processes. Alternatively,these processes may be performed by specific hardware components thatcontain hardwired logic for performing the processes, or by anycombination of programmed computer components and custom hardwarecomponents.

Elements of the present invention may also be provided as amachine-readable medium for storing the machine-executable instructions.The machine-readable medium may include, but is not limited to, floppydiskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASHmemory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards,propagation media or other type of media/machine-readable mediumsuitable for storing electronic instructions. For example, the presentinvention may be downloaded as a computer program which may betransferred from a remote computer (e.g., a server) to a requestingcomputer (e.g., a client) by way of data signals embodied in a carrierwave or other propagation medium via a communication link (e.g., a modemor network connection).

In the foregoing specification, the invention has been described withreference to specific exemplary embodiments thereof. It will, however,be evident that various modifications and changes may be made theretowithout departing from the broader spirit and scope of the invention asset forth in the appended claims. The specification and drawings are,accordingly, to be regarded in an illustrative rather than a restrictivesense.

What is claimed is:
 1. An apparatus, comprising: processor logiccircuitry to execute program code; a multi-level memory comprising afirst level of memory and a second level of memory, the first level ofmemory having lower access times than the second level of memory, thefirst level of memory being composed of a different type of memorytechnology than the second level of memory, the first level of memorybeing integrated on a same semiconductor chip as the processor logiccircuitry, the second level of memory residing outside the semiconductorchip that the processor logic circuitry and the first level of memoryare integrated on, the second level of memory being composed of dynamicrandom access memory (DRAM), wherein, the first level of memory is tohave a first portion that is to function as a cache for the processorlogic circuitry and a second portion that is to have a region of addressspace that is addressable by the program code, and wherein, the secondlevel of memory is also addressable by the program code; and,compression circuitry to compress data items that are entered into thefirst level of memory to enhance the first level of memory's memorycapacity, the compression circuitry designed to compress two lines ofdata into one line of data.
 2. The apparatus of claim 1 wherein thefirst level of memory is composed of SRAM.
 3. The apparatus of claim 1wherein the compression circuitry is designed to compress to at leastone compression ratio other than 2:1.
 4. The apparatus of claim 3wherein the compression circuitry is designed to compress at a ratio of4:1.
 5. The apparatus of claim 1 wherein the first and second levels ofmemory are capable of storing graphics information.
 6. An apparatus,comprising: processor logic circuitry to execute program code; acontroller to control a multi-level memory comprising a first level ofmemory and a second level of memory, the first level of memory havinglower access times than the second level of memory, the first level ofmemory being composed of a different type of memory technology than thesecond level of memory, the first level of memory being integrated on asame semiconductor chip as the processor logic circuitry, the secondlevel of memory residing outside the semiconductor chip that theprocessor logic circuitry and the first level of memory are integratedon, the second level of memory being composed of dynamic random accessmemory (DRAM), wherein, the first level of memory is to have a firstportion that is to function as a cache for the processor logic circuitryand a second portion that is to have a region of address space that isaddressable by the program code, and wherein, the second level of memoryis also addressable by the program code; and, compression circuitry tocompress data items that are entered into the first level of memory toenhance the first level of memory's memory capacity, the compressioncircuitry designed to compress two lines of data into one line of data.7. The apparatus of claim 6 wherein the first level of memory iscomposed of SRAM.
 8. The apparatus of claim 6 wherein the compressioncircuitry is designed to compress to at least one compression ratioother than 2:1.
 9. The apparatus of claim 8 wherein the compressioncircuitry is designed to compress at a ratio of 4:1.
 10. The apparatusof claim 6 wherein the first and second levels of memory are capable ofstoring graphics information.
 11. A method, comprising: executingprogram code with processor logic circuitry; controlling a multi-levelmemory comprising a first level of memory and a second level of memory,the first level of memory having lower access times than the secondlevel of memory, the first level of memory being composed of a differenttype of memory technology than the second level of memory, the firstlevel of memory being integrated on a same semiconductor chip as theprocessor logic circuitry, the second level of memory residing outsidethe semiconductor chip that the processor logic circuitry and the firstlevel of memory are integrated on, the second level of memory beingcomposed of dynamic random access memory (DRAM), wherein, the firstlevel of memory has a first portion that functions as a cache for theprocessor logic circuitry and a second portion that has a region ofaddress space that is addressable by the executing program code, andwherein, the second level of memory is also addressable by the executingprogram code; and, compressing data items that are entered into thefirst level of memory to enhance the first level of memory's memorycapacity, the compressing comprising compressing two lines of data intoone line of data.
 12. The method of claim 11 wherein the first level ofmemory is composed of SRAM.
 13. The method of claim 11 wherein thecompression circuitry is designed to compress to at least onecompression ratio other than 2:1.
 14. The method of claim 13 wherein thecompression circuitry is designed to compress at a ratio of 4:1.
 15. Themethod of claim 11 wherein the first and second levels of memory arecapable of storing graphics information.